The Network
Contents
The Network#
import numpy as np
import pickle
import networkx as nx
import pandas as pd
import holoviews as hv
import geoviews as gv
from colorcet import palette
import bokeh
import warnings
import matplotlib.pyplot as plt
import matplotlib.lines as mlines
from bokeh.sampledata.us_states import data as us_states
from mpl_toolkits.basemap import Basemap as Basemap
from holoviews.operation.datashader import datashade, directly_connect_edges
from shapely.errors import ShapelyDeprecationWarning
warnings.filterwarnings("ignore", category=ShapelyDeprecationWarning)
hv.extension('bokeh')
The dataset used to for the project was created form a larger one collected reaserchers from ELTE, availabel on the Kooplex platform.
Kooplex#
The main task here, was to extract the users whose main geographical location was within the United States of America and then extract the followers of these users using SQL queries. There were two tables used Twitter.dbo.user_location_cluster and Twitter.dbo.user_follower. The first is queryed to get the geolocation of a twitter user in latitude and longitude, the second is to extract the followers of the formerly selected users.
To only choose twitter users from the United States, the users needed to bounded in a box, approximetly the area of the U.S.A., which I could get from this site.
min latitude |
max latitude |
min longitude |
max longitude |
|---|---|---|---|
24.9493 |
49.5904 |
-125.0011 |
-66.9326 |
After extracting the user_id-s and follower_user_id-s, I filtered the followers to only contain users from the U.S. aswell. The number of users in the U.S.A. is 7198227 and the number of edges (follows) is 34913927, but I had to reduce the edges to contain only bidirectional edges, meaning instances where to users follow eachother mutually. The number of these edges is 22201314 and the number of users in this network is 2795066. These are our final numbers.
edges = pd.read_csv(r'C:\Users\dajka\Documents\Egyetem\MSC\III\dsdatasci\data/bidirectional_edges.csv')
user_df = pd.read_csv(r'C:\Users\dajka\Documents\Egyetem\MSC\III\dsdatasci\data/us_users_in_network.csv')
Interactive visualization with datashader#
The first one is an interactive visualization of network, but to make it more navigable I limited the number of users to the ones with the most followers.
The code is partially from this site
# Select the 150 most followed users
most_followed_df = user_df.sort_values('follower_number', ascending=False).iloc[:150,]
# Select only US mainland airports
user_points = gv.Points(most_followed_df, kdims=['lon', 'lat'])\
.select(Latitude=(20, 70), Longitude=(-175, -50))
routes = edges[edges.iloc[:,0].isin(user_points.data.user_id) &
edges.to.isin(user_points.data.user_id)]
# Convert from Mercator to Latitudes/Longitudes
user_points = gv.operation.project_points(user_points)
# Declare nodes, graph and tiles
nodes = hv.Nodes(user_points.data, kdims=['lon', 'lat', 'user_id'],
vdims=['follower_number'])
graph = hv.Graph((routes, nodes), kdims=['from', 'to'], vdims=['from', 'to'])
tiles = gv.WMTS('https://maps.wikimedia.org/osm-intl/{Z}/{X}/{Y}@2x.png')
%%opts RGB () Graph [width=800 height=800] (edge_selection_line_color='black' edge_hover_line_color='red')
%%opts Graph (node_size=8 edge_line_alpha=0 edge_hover_line_alpha=1 edge_selection_line_alpha=1 edge_nonselection_line_alpha=0)
tiles * datashade(directly_connect_edges(graph), cmap=palette.bgy, width=800, height=800) * graph
Plot the network with Basemap#
An other visualization is of the twitter users from the U.S. using networkx and Basemap, a python library. The code is from: https://tuangauss.github.io/projects/networkx_basemap/networkx_basemap.html
